-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resilient saved object migration algorithm #78413
Conversation
}; | ||
const control = state.controlState; | ||
if (control === 'INIT') { | ||
return delay(Actions.fetchAliases(client, ['.kibana', '.kibana_current', '.kibana_7.11.0'])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the action is tightly coupled with the state. Is it possible to collocate them closer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm very excited about this FSM-based approach. This will make it much easier for us to test very specific states in isolation. I took a very similar approach with the reindexing mechanism in the Upgrade Assistant and it was much easier to reason about.
If we add detailed debug logging for the state & actions, understanding failures in production will also be much eaiser for our support team. We should definitely add detailed documentation on what these states mean and how to diagnose problems (but make clear that these states are not 'stable' so that we can change them between versions, if necessary)
67bcbe1
to
6ad7a54
Compare
The new saved-object migrations algorithm relies on the fact that the migrations defined by the plugins are deterministic. Do we have any controls in place to ensure that all current and future migrations are deterministic? |
7d7db7c
to
8d011b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the code coverage is way better now.
Still got some nits/comments, but nothing that significant.
if (log.type === 'transition') { | ||
logger.info(logMessagePrefix + `${log.prevControlState} -> ${log.controlState}`, log.state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: logger.info(
${logMessagePrefix} ${log.prevControlState} -> ${log.controlState}, log.state);
(multiple occurences)
src/core/server/saved_objects/migrationsv2/migrations_state_action_machine.ts
Outdated
Show resolved
Hide resolved
}); | ||
|
||
const countUntilThree = countUntilModel(3); | ||
const finalStateP = stateActionMachine(state, next, countUntilThree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: this can throw. I would at least move it to a beforeAll
or beforeEach
block
pipe( | ||
TaskEither.tryCatch( | ||
() => transformRawDocs(state.outdatedDocuments), | ||
(e) => { | ||
throw e; | ||
} | ||
), | ||
TaskEither.chain((docs) => | ||
Actions.bulkOverwriteTransformedDocuments(client, state.targetIndex, docs) | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: should probably be 'extracted' to the action creator
if (state.controlState === 'DONE' || state.controlState === 'FATAL') { | ||
// Return null if we're in one of the terminating states | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultraNIT: could introduce either
isTerminal(controlState)
terminalStates
array +terminalStates.includes(state.controlState)
src/core/server/saved_objects/migrations/kibana/kibana_migrator.ts
Outdated
Show resolved
Hide resolved
src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.test.ts
Outdated
Show resolved
Hide resolved
const outdatedDocumentsQuery = { | ||
bool: { | ||
should: Object.entries(migrationVersionPerType).map(([type, latestVersion]) => ({ | ||
bool: { | ||
must: { term: { type } }, | ||
must_not: { term: { [`migrationVersion.${type}`]: latestVersion } }, | ||
}, | ||
})), | ||
}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little strange that this is put in the state, but OTOH not doing it would force to put migrationVersionPerType
in it instead, so not sure which one is the best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah @restrry had a good idea to create a KibanaModel
that could contain all the state and logic that's static, things we don't collect as part of the algorithm but rather that's injected into it at the start. We could still attach the KibanaModel onto the state, but it could for instance be a class and wouldn't have to be a POJO like the rest of the state.
Didn't get around to it though :(
if (stateP.controlState === 'INIT') { | ||
const res = resW as ExcludeRetryableEsError<ResponseType<typeof stateP.controlState>>; | ||
|
||
if (Either.isRight(res)) { | ||
const indices = res.right; | ||
const aliases = getAliases(indices); | ||
|
||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file would still need to be split into per-state files.
@elasticmachine merge upstream |
I'm merging this so that we can get a snapshot build for QA to test with and so that we can backport it to 7.x (7.12) which will make it a lot simpler to test with 6.x data. |
Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync. |
💚 Build SucceededMetrics [docs]Distributable file count
History
To update your PR or re-run it, just comment with: |
Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync. |
* Resilient saved object migration algorithm (#78413) * Initial structure of migration state-action machine * Fix type import * Retries with exponential back off * Use discriminated union for state type * Either type for actions * Test exponential retries * TaskEither types for actions * Fetch indices instead of aliases so we can collect all index state in one request * Log document id if transform fails * WIP: Legacy pre-migrations * UPDATE_TARGET_MAPPINGS * WIP OUTDATED_DOCUMENTS_TRANSFORM * Narrow res types depending on control state * OUTDATED_DOCUMENTS_TRANSFORM * Use .kibana instead of .kibana_current * rename control states TARGET_DOCUMENTS* -> OUTDATED_DOCUMENTS* * WIP MARK_VERSION_INDEX_READY * Fix and expand INIT -> * transition tests * Add alias/index name helper functions * Add feature flag for enabling v2 migrations * split state_action_machine, reindex legacy indices * Don't use a scroll search for migrating outdated documents * model: test control state progressions * Action integration tests * Fix existing tests and type errors * snapshot_in_progress_exception can only happen when closing/deleting an index * Retry steps up to 10 times * Update api.md documentation files * Further actions integration tests * Action unit tests * Fix actions integration tests * Rename actions to be more domain-specific * Apply suggestions from code review Co-authored-by: Josh Dover <me@joshdover.com> * Review feedback: polish and flesh out inline comments * Fix unhandled rejections in actions unit tests * model: only delay retryable_es_client_error, reset for other left responses * Actions unit tests * More inline comments * Actions: Group index settings under 'index' key * bulkIndex -> bulkOverwriteTransformedDocuments to be more domain specific * state_action_machine tests, fix and add additional tests * Action integration tests: updateAndPickupMappings, searchForOutdatedDocuments * oops: uncomment commented out code * actions integration tests: rejection for createIndex * update state properties: clearer names, mark all as readonly * add state properties currentAlias, versionAlias, legacyIndex and test for invalid version scheme in index names * Use CONSTANTS for constants :D * Actions: Clarify behaviour and impact of acknowledged: false responses * Use consistent vocabulary for action responses * KibanaMigrator test for migrationsV2 * KibanaMigrator test for FATAL state and action exceptions in v2 migrations * Fix ts error in test * Refactor: split index file up into a file per model, next, types * next: use partial application so we don't generate a nextActionMap on every call * move logic from index.ts to migrations_state_action_machine.ts and test * add test * use `Root` to allow specifying oss mode * Add fix and todo tests for reindexing with preMigrationScript * Dump execution log of state transitions and responses if we hit FATAL * add 7.3 xpack tests * add 100k test data * Reindex instead of cloning for migrations * Skip 100k x-pack integration test * MARK_VERSION_INDEX_READY_CONFLICT for dealing with different versions migrating in parallel * Track elapsed time * Fix tests * Model: make exhaustiveness checks more explicit * actions integration tests: add additional tests from CR * migrations_state_action_machine fix flaky test * Fix flaky integration test * Reserve FATAL termination only for situations which we never can recover from such as later version already migrated the index * Handle incompatible_mapping_exception caused by another instance * Cleanup logging * Fix/stabilize integration tests * Add REINDEX_SOURCE_TO_TARGET_VERIFY step * Strip tests archives of */.DS_Store and __MAC_OSX * Task manager migrations: remove invalid kibana property when converting legacy indices * Add disabled mappings for removed field in map saved object type * verifyReindex action: use count API * REINDEX_BLOCK_* to prevent lost deletes (needs tests) * Split out 100k docs integration test so that it has it's own kibana process * REINDEX_BLOCK_* action tests * REINDEX_BLOCK_* model tests * Include original error message when migration_state_machine throws * Address some CR nits * Fix TS errors * Fix bugs * Reindex then clone to prevent lost deletes * Fix tests Co-authored-by: Josh Dover <me@joshdover.com> Co-authored-by: pgayvallet <pierre.gayvallet@elastic.co> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # rfcs/text/0013_saved_object_migrations.md * createRootWithCorePlugins support for cliArgs which wasn't backported to 7.x * Attempt to stabilize cloneIndex integration tests (#86123) * Attempt to stabilize cloneIndex integration tests * Unskip test * return resolves/rejects and add assertions counts to each test * Await don't return expect promises * Await don't return expect promises for other tests too * Remove 8.0.0 integration test
Summary
Implements: #75780
Current status
Checklist
Delete any items that are not applicable to this PR.
For maintainers